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DISCUSSION: ONE-STEP SPARSE ESTIMATES 
IN NONCONCAVE PENALIZED LIKELIHOOD MODELS: 
WHO CARES IF IT IS A WHITE CAT OR A BLACK CAT?i 

By Xiao-Li Meng 
Harvard University 

1. An insider's minor comments. Section 2.3 seems to be the reason 
that I am a discussant. There, it was first stated that the proposed LLA 
algorithm is an instance of the MM algorithm, as termed by Lange, Hunter 
and Yang [8]. Then it was shown, under certain conditions, that it is also 
an EM algorithm. My initial reaction was "hmmm, the authors' reading of 
Lange, Hunter and Yang [8] must have ceased before reaching its discus- 
sions," because a more general "MM = EM" result using the same Laplace 
transform technique was the base for a key inquiry of Meng [10], a discussion 
of Lange, Hunter and Yang [8]. Upon a more careful reading, I realized that 
the authors' construction, though mathematically equivalent to mine, gives 
a different interpretation to the constructed missing data/latent variable. 
This is rather interesting, especially if my initial reaction was correct. 

For the current paper, this "MM = EM" result appears to be of minor 
interest, especially because its potential benefit is not explored in the paper. 
Instead, the only punch line seems to be a logically unsubstantiated one: 
"Thus, Theorem 3 also indicates that MM algorithms are more flexible than 
EM algorithms." To clarify these issues, which seems to be what I have been 
asked to do, I'll focus my discussion first on this algorithmic connection and 
qualify the statements that I just made. I will then pose some questions going 
beyond those algorithmic and computational considerations. (The two parts 
are connected via the "cat" title, for those who are patient enough to read 
both.) 

To do so most effectively, let me invoke an author's privilege to reproduce 
Section 2 of Meng [10] in its entirety, to map out its mathematical connec- 
tions with Section 2.3 of the current paper. The acronym "SM" below was 
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my suggestion because the general recipe Lange, Hunter and Yang [8] put 
forward consists of a "Surrogate step" and a "Maximization/Minimization 
step"; in their rejoinder, Lange, Hunter and Yang [8] coined the term "MM," 
partiahy in fear of the association of "SM" with "S&M." The "alphabet 
soup" phrase was used by Lange, Hunter and Yang [8] to describe the col- 
lection of acronyms in the EM literature; indeed, for readers who enjoy 
collecting acronyms, the "GAECM" below is another treasure to hunt! 

2. Is SM just EM? Of course, the new alphabet soup will not be a really new 
delight if it is just the old soup presented in a new, perhaps larger, bowl. Could 
it be that SM, though apparently more general, is just a disguised or beautified 
version of EM? The answer is not completely obvious, especially if one starts 
the comparison with the most obvious construction of the surrogate func- 
tion via linear minorization/majorization. As in the authors' equation (3.1) 
(page 9), assume our log-likelihood function L{6\y) can be written as 

(2.1) my) = fy{9)~gy{9), eeQcR\ 

where both fy and Qy are concave functions and without loss of generality 
[when |gj;(0)| < oo] we assume gy{0) = for all y. Now suppose e"**'-"-' is the 
moment generating function of a conditional density h{z\y), namely, 

(2.2) e-'"^^"^ = j e''h[z\y)ii{dz), BeQ. 

Then if we augment p{y\0) = e^'"'"* to 

(2.3) p(z|y,%(y|e)^[e«^+««Wft(z|y)][e^(«l^)]=e^«(«)+«^Mz|y), 
we have, for the standard EM construction, 

(2.4) Q{e\e^'^) = fy{e) + enz\y,e'^'^)+F.[\ogh{z\y)\y,e^\ 

But this is equivalent to the proposed linear minorization surrogate function 

(2.5) Q{e\e^'^) = fy{e)-g'y{e^'^){e~e'^''^), 

because E(Z|t/,&) = —g'y{0) from differentiating both sides of (2.2). Inciden- 
tally, by difi'erentiating both sides of (2.2) twice, we have gy{0) = — V(Z|j/, 6) < 
0, and thus the concavity of g{6) is a necessary condition for this EM con- 
struction to be possible. [For multivariate 9 we can construct the missing data 
Z with the same dimension and replace 9z in (2.2) with 6^ z.] 

Although this EM construction is not always possible (e.g., when e"^"'^' 
may not be a moment generating function), and even when it is possible it 
requires more brain power than the linear minorization method, it neverthe- 
less suggests that a large class of SM algorithms based on (2.5) are also EM 
algorithms with augmentation p{z,y\9) of (2.3). 

So the question is, given Q{9\(l>) from a particular SM construction, how 
do we know if there is a corresponding EM construction, regardless of how 
convoluted the latter might be? The practical relevance of this theoretically 
interesting question is that, if the EM class is as rich as the SM class, then 
the value of the new SM formulation is in providing a set of new tools for 
creative EM-type implementation. However, if the SM class is richer than the 
EM class, then it provides hope for solving problems that are difficult or even 
impossible to solve within the entire GAECM framework. 
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To apply the above result to the setting of authors' (2.1), we simply set 
= /3 and let fyi9) = j:7=iUfi), 9yiO) = nYfj=iPXjiPj)- Then the authors' 
(2.1) can be expressed as my (2.1) if we replace the 9 = f3 in gy{0) by = 
(|/?i|, . . . , |/3p|)'''. The authors' equivalence result then follows if we make the 
same replacement of 9 by \9\ in my augmentation (2.3) above, that is, if we 
augment p{y\9) = e^^^^^^^ to 

(1.1) p{z\y,9)p{y\9) = [el^l^"+3«(l^')/i(z|y)] [e^^^^'^)] = e-^^(^)+l^l^"/i(z|y). 

A key difference between the authors' construction and mine is that in the 
authors' setting, the augmented variable — or auxiliary variable, as known in 
statistical physics (but AV in either case!) — r is a hyper-parameter, while 
in my construction, the AV z represents missing data. Although I believe 
that the authors' reciprocal transformation, that is, taking r oc is un- 
necessary and only mystifies the construction, the authors' hyper-parameter 
interpretation further helps to emphasize the key question raised in the last 
paragraph of the quoted Section 2 above. That is, given an MM (a.k.a. SM) 
algorithm, how do we know it is not an EM algorithm, considering there are 
infinitely many ways to construct AVs? Just because one particular construc- 
tion fails to reproduce an MM algorithm as an EM algorithm, as in authors' 
Theorem 3, logically it says nothing about whether this MM algorithm can 
be reproduced as an EM algorithm by a different AV construction. 

In fact, the identification of whether an MM algorithm is also an EM 
algorithm (the reverse of course is always true) is an unexpectedly difficult 
problem. In Meng [10], I gave a sufficient and necessary condition for an MM 
to be an EM. Verifying this condition turns out to be the same as determin- 
ing when a normalized yoke is also an expected likelihood yoke, a very chal- 
lenging problem in differential geometry. Any reader who is curious about 
the notion of yoke, or wants to take on this challenge, is invited to spend 
an intensive evening with me via Meng [10], with, of course, guests such as 
Barndorff-Nielsen [2], Barndorff-Nielsen and Jupp [3] and Blaesild [4] — there 
could be an AOS-worthy paper here! 

I surmise most readers of the current paper would prefer to spend an 
evening with someone else. Indeed, I am a bit curious myself about the 
connection of this equivalence result to the rest of the paper. "Who cares 
if it is a white cat or black cat, as long it catches mice," so goes a famous 
quotation attributed to Deng Xiaoping (allegedly the statement got him into 
trouble before he regained his power in China). So who cares if it is an MM 
or EM, as long as it finds the right solution? One answer is that once we 
know an algorithm is an EM algorithm, we can take advantage of the vast 
literature on the EM algorithm to study its theoretical properties or improve 
its speed by applying techniques designed for EM-type of algorithms. I can 
certainly think of the benefit of deriving its theoretical rate of convergence 
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along the lines of Meng and Rubin [11, 12] or Meng [9], or trying to speed it 
up by attempting the recipes given in Meng and van Dyk [13, 14], either of 
which could lead to useful insights or improvement of LLA. Seeing no such 
discussions in this paper, I was almost tempted to self-indulge the possibility 
that this MM-EM equivalence issue was included because the authors wanted 
me to be a discussant (circumstantial evidences: [10] is the only general 
investigation of the equivalence result, to the best of my knowledge). Of 
course, my ego-checking side has to wonder: where is the beef (or mouse!)? 

2. An outsider's major questions. Having done my insider's job but 
failed to identify a mouse, let me now try to catch a bird by going outside! 
As an outsider (to penalized likelihood), I gather I can enjoy the freedom 
of asking naive questions and rely on the authors' expertise to correct any 
"mis-information" my inquiries might have created for the readers. 

Q.l. Should we care about the difference between penalized likelihood methods 
and Bayesian methods? 

To many Bayesians, and perhaps some semi-Bayesian or even non-Bayesians, 
using penalized likelihood is an attempt of enjoying the Bayesian fruits with- 
out paying the B-club fee. Although some pena-likelihoodists might be of- 
fended by this analogy, many of us perhaps are of the "who cares?" opinion 
here again: who cares if it has a PL-label or B-label, as long as it gets the 
job done? Indeed, I had been one of those until I gave the question a bit 
more thought when I was studying the literature on wavelet thresholding, a 
form of variable selection. 

The central question here is: are the two methods as general and flexi- 
ble as each other, so the difference is only in perception not in operation? 
On the surface they appear to be: just equate the penalty function term 
with the log of the prior density — the fact that Bayesian methods allow 
improper prior densities seems to make this equivalence almost completely 
general. However, in the Bayesian framework, once a posterior distribution 
is obtained, there is more than one way to utilize it; in the pena-likelihood 
world, maximizing the likelihood seems to be the only general recipe. 

A main focus of the current paper is about finding efficient and practical 
algorithms for maximizing the penalized log-likelihood in the form of au- 
thors' (2.1). Various approximations were made to the penalty function to 
overcome technical difficulties. When confined in the pena-likelihood frame- 
work, this seems to be the only game in town, as least from an outsider's 
perspective. When recast in the Bayesian framework, at least three addi- 
tional considerations come to my mind, however. 

First, considering the penalty function as the log of the prior density, it 
seems to me more fruitful to regard each approximation in its own right as a 
prior specification, rather than an approximation to an "ideal" penalty func- 
tion or log-prior. Evidently, LQA can be viewed as using a locally-adaptive 
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Gaussian prior, and LLA a locally-adaptive Laplace prior. Viewed in this 
way, it is almost inevitable to wonder what will happen if we adopt a locally- 
adaptive t-distribution as our prior or some other families of distributions? 
Even for people who simply do not want the B-club membership, these links 
can potentially provide new approximations to penalty functions that are 
otherwise hard to come by. For example, using mixtures as prior distribu- 
tions is a common practice in the Bayesian world, but it may not come as 
naturally when one is in the penalty function mood, though the authors' 
(2.2) hints at that direction. 

Second, a key advantage of using an penalty appears to be its ability 
to ensure sparse representation. From a Bayesian perspective, a sparse rep- 
resentation means that a priori we should put nontrivial mass at (5j = for 
any j (since a priori we do not know which one is zero). This leads natu- 
rally to the mixture prior specifications, such as those used in Abramovich, 
Sapatinas and Silverman [1] and Johnston and Silverman [5]. The need for 
nontrivial mass at zero seems to offer an alternative insight for the nondif- 
ferentiability at zero of the SCAD penalty family, as the authors discussed. 
Indeed, only thinking in terms of prior specifications allowed me, as an out- 
sider, to understand intuitively why the optimal penalty function has to be 
nondifferentiable at the origin. 

Third, once a posterior distribution is in place, we can consider several 
point-estimate summaries other than its mode (i.e., corresponding to MLE). 
Indeed, as demonstrated in Abramovich, Sapatinas and Silverman [1], if we 
use the posterior median, then we can also ensure sparse representation un- 
der the mixture priors mentioned above. To see this most clearly in general, 
suppose our prior for a (scalar) regression coefficient (3j is a mixture of an 
atom at Pj = and a continuous component, whose cumulative distribution 
function (CDF) is given by Fj{Pj), j = 1, . . . ,p. Under the assumption of a 
priori independence, the joint prior CDF for . . . ,/3p} is given by 

(2.1) F(/3i, = n[^il(/3.>o) + (1 - ^j)F,mi 

1=1 

where tTj is the prior mass on Pj =0, governing our desired degree of sparsity 
(e.g., setting ttj = 0.9 for all j means that we a priori believe that only about 
10% of the P coefficients should be significant). 

Under this setting, it is easy to verify that the marginal posterior dis- 
tribution for any Pj is also a mixture of an atom at zero and a continuous 
component: 

(2.2) F,(/3,-|y) =7ry,,l(^^.>o) + (1 - ^yj)F,,,-(/?, |y). 

The exact expressions for vTyj and Fcj{Pj\y) follow directly from the Bayes 
theorem, and they are the simplest when the likelihood L{Pi, . . . , Pp) factors 
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into JJj Lj{Pj), as in the wavelet applications of Abramovich, Sapatinas and 
Silverman [1] and Johnston and Silverman [5], in which case the . . . , f3p} 
are a posteriori independent. But regardless of this independence, as long 
as we decide to threshold each coefficient separately (in contrast to "block 
thresholding"), which seems to be the case for the current paper, we only 
need to deal with the marginal distribution as given in (2.2). We therefore, 
for notation simplicity, will drop the subscript j in the following discussion. 
(For a "true" Bayesian such a marginal approach can also be acceptable 
when the dependence is not too strong and "joint thresholding" is computa- 
tionally too expensive to be practical. But even in such cases the Bayesian 
thinking clearly helps us to understand what corners we have cut, thereby 
constructively pointing us to directions for improvement when more efficient 
computational tools become available.) 

Given the setting above, it is trivial to see that if the posterior mass 
TTy = Pr(/3 = 0|y) > 1/2, then Pr(/3 < 0|y) > 1/2 and Pr(/3 < 0) < 1/2. Re- 
call that the median of a (right-continuous) CDF F{x) is the (or any) value 
M such that 



Consequently, whenever vTy > 1/2, the posterior median of /3, denoted by 
Med(/3|y), must be exactly zero, provided that the continuous component 
Fc has no "flat" region, a condition we assume. [Without this assumption, 
Med(/?|y) may not be unique, a technical complication that is of little inter- 
est for almost any practical purposes.] In other words, the median procedure 
will exclude a coefficient Pj if the posterior lends stronger (or no weaker) 
support to Pj = than to Pj 7^ 0, a very intuitive and appealing requirement 
for sparsity to occur in our inference. 

Given that vTy > 1/2 is a sufficient condition for declaring sparsity under 
the posterior median approach, it is natural to ask if it is also a necessary 
condition. That is, is it possible for Med(/3|y) to be exactly zero when TTy < 
1/2? The answer is positive, as demonstrated in Abramovich, Sapatinas 
and Silverman [1]. This might appear to be somewhat counterintuitive at 
first sight, because vTy < 1/2 seems to imply that the hypothesis /3 = is 
less supported by the data than the hypothesis /3 7^ is. There is a subtle 
distinction here, however. A lack of support for the atom component of the 
mixture distribution is not the same as a lack of support for the value of /3 to 
be zero. The continuous part of the posterior distribution, namely, Fc(/3|y), 
can also provide evidence for P = 0, even if it was "designed" for inference for 
P away from zero. This perhaps can be best seen in the extreme case where 
we set TT = 0, that is, we do not believe sparsity a priori and, hence, we do 
not even allow the atom component to show up in our model. Nevertheless, 
our posterior inference can still provide good evidence for P = 0, or more 



(2.3) 



F(M) > i but F(A/„) = lim F(x) < i 
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precisely, for /? being in a small neighborhood of /? = 0, if, for example, our 
posterior density ends up like a normal density with mean (near) zero. 

This observation suggests that even when vTy < 1/2, there should be a 
threshold (< 1/2) for TTy above which Med(/?|y) would still be zero because 
the additional evidence for /3 = from the continuous component Fc{/3\y) is 
able to compensate the degree of failing of reaching vTy > 1/2 by the atom 
component. To see this clearly, let Medc(/3|y) be the posterior median from 
Fc{l3\y). If Medc(/3|y) > 0, then for any /? < 0, 

(2.4) F{P\y) = (1 - 7ry)F,(/?|y) < < \. 

Hence, Med(/3|y) must be nonnegative. This could let us to declare that 
Med(/?|y) must be the solution of 

(2.5) F(/?|y) = VTy + (1 - 7ry)F,(/3|y) = i, 

solving which would yield Med(/3|y) = F^^{^—^\y), where Oy = 7ry/(l — 
TTy) is the posterior odds for /3 = 0. One has to realize, however, that although 
equation (2.5) always has a unique solution whenever Oy < 1, this solution is 
nonnegative if and only if Oy < 1 — 2Fc{0\y), which is a nonempty condition 
because Fc{0\y) < 1/2 under our assumption that Medc(/3|y) > 0. When 
this condition fails, clearly F(0|y) > 1/2, and hence, Med(/3|y) = 0, because 
F(0_|y) < 1/2, a consequence of (2.4). 

Similarly, when Medc(/3|y) < 0, then for any /3 > 0, 

(2.6) F(/3_|y) = TTy + (1 - 7Ty)FMy) > % + |(1 " %) > h 
Therefore, Med(/3|y) must be nonpositive. This means that Med(/3|y) is 
the solution of (1 — 7ry)Fc(/J|y) = 1/2, when the solution F~^{ ^~^^ ^ |y) < 0, 
which holds if and only if Oy < 2Fc{0\y) — 1. When this condition fails, 
F(0_[y) < 1/2, and hence, Med(/3|y) = because (2.6) implies F(0) > 1/2. 

Finally, when Medc(/3|y) = 0, 

i^(0-|y) = ^^<^ and F(0|y) = i±^>i, 

hence, Med(/3|y) = 0. Summarizing these derivations yields the following 
general result, which provides a theoretically more appealing form to gain 
insights than detailed expressions derived for specific models, such as the 
formulae in Abramovich, Sapatinas and Silverman [1] under normality. 

Lemma 1. Assume the continuous component Fc{(3\y) in the posterior 
mixture is strictly monotone, and let Sc,y = Sign{Medc(/3|y)}, the sign of 
Medc(/3|y). Then the posterior median of (3 has the following analytic ex- 
pression, where Oy is the posterior odds for (3 = 0: 

(2.7) Med(fly) = |^=-'(^^4^k)- if Oy < H - 2FMy)l. 

(O, !/O,>|l-2F,(0|y)l- 
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This result, of course, is elementary (so no credit is claimed!), but the 
insights deriving from it are less so. First, the quantity Ac(y) = |1 — 2i<'c(0|y)| 
provides the threshold we were looking for, that is, the smallest value of Oy 
such that if Oy takes any value below it, then the evidence for /3 = from the 
continuous component can no longer compensate for the fact that Oy < 1 
[note that Oy > Ac(y) holds trivially when Oy > 1]. Because 

Ae(y) = I PiciP > 0|y) - PrciP < 0|y)|, 

where the probability calculation Pre is with respect to Fc{-\y), we see clearly 
that the closer Medc(/3|y) is to zero — and hence the more "borrowed" ev- 
idence for supporting /3 = — the more tolerant the median approach be- 
comes in setting the lower bound on the permissible value of Oy for hard- 
thresholding p. Indeed, in the extreme case when Medc(/3|y) is exactly zero, 
the posterior median approach will set (5 = regardless of the value of Oy. 
This makes sense because Medc(/3|y) = is the strongest evidence for /3 = 
from the continuous component under the norm (and when only point 
estimators are considered). 

Second, even when Oy < Ac(y), there is still some evidence for /? = 0, 
and the posterior median approach directly takes into account the strength 
of this evidence by shrinking Medc(/3|y) = -^c~^(lly) toward zero according 
to the value of Oy. By Lemma 1, Med(/3|y) is always closer to zero than 
Medc(/3|y), and how much closer is directly determined by Oy. Third, this 
shrinkage is actually a double shrinkage, representing a sensible form of "soft 
thresholding." This is perhaps most clearly seen when Fc{P\y) is given by a 
symmetric location-scale family (e.g., normal): Q(^^iiM^ASiyl'^^ where o"c,y is 
the posterior scale, measuring the uncertainty in the continuous component 
of the posterior distribution [and in Medc(/3|y)] for inferring f3. In such cases, 
Lemma 1 implies that, when Oy < Ac(y) [and hence Medc(/3|y) ^ 0], 

f MedMy) - <^c,yG~^ (^-^) ' if Med,(/3|y) > 0; 
Med(/?|y)= )l + 0\ 

[ Med,(/3|y) + ac,yG-' [~^^) ' Med,(/3|y) < 0. 

(2.8) 

The "double shrinkage" is seen here because Medc(/?|y) is often a shrink- 
age estimator by itself (e.g., in the normal case and when the prior mean 
for Fc is set to zero, as typical in applications). But as long as Oy > and 
<7c,y > (always true in practice), Med(/?|y) further shrinks toward zero. 
And the larger (Tc,y or Oy, the greater the shrinkage. This makes perfect 
sense, as larger (Tc,y implies Medc(/?|y) to be less trustworthy and, hence, 
more evidence for /? = 0; similarly and in fact more directly, the larger Oy, 
the more evidence of shrinkage toward zero [indeed, when Oy exceeds Ac(y), 
we will shrink Medc(/?|y) all the way to zero!]. 
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I provide this rather detailed derivation and discussion to demonstrate 
that achieving sparsity via the posterior median under a mixture prior is 
conceptually (and often computationally) more appealing than optimizing 
a penalized likelihood. With the latter approach, one appears to be using 
a black box, as determined by an optimization algorithm, much more so 
than with the former, where every step helps to gain probabilistic insights. 
Furthermore, one wonders if the aspect should go directly into the model, 
as with the pena-likelihood approach, or go into the loss function, as with the 
Bayesian approach (recall that median is the optimal estimator under the 
loss). But these concerns by no means undermine the value of the MLE 
approach; indeed, in Kong, Meng, Nicolae and Tan [7] and Kong, Meng 
and Nicolae [6], we identified a situation where it is logically impossible 
to perform any Bayesian inference, but MLE offers a very useful solution. 
Rather, they serve as an invitation to the authors to explore alternative 
methods for achieving sparsity, as well as to include the posterior median 
approach in their list of methods for comparison. 

Q.2. Should we really care about being only one-step, not multi-step? 

The authors clearly took pride in their one-step LLA estimator, a pride 
many authors would share. It achieves sparsity, requires least computation, 
and is "as efficient as the fully iterative method, provided that the initial 
estimators are reasonably good." Indeed, the authors' simulation results 
show — if I understand their notation correctly — that it is even capable of 
outperforming the estimator from the fully iterative algorithm it supposedly 
approximates. So why would anyone still care about doing more than one 
step? 

I would, from both practical and theoretical points of view. Practically, 
the statement "provided that the initial estimators are reasonably good" 
worries me. Many methods work beautifully — at least one hopes — in the 
hands of experts, who have experiences, insight and the necessary skills to 
carry out whatever fine tuning is needed to achieve all the good properties 
as promised. But once it is "at large," a cute indoor pet might be too excited 
or too frightened to be as obedient as it is with its loving owner, especially 
when it sees a complete stranger. One advantage of the MM or EM- type 
algorithms, because of their ascent property, is their relative robustness to 
starting points, thereby reducing the burden of fine tuning for a novice. 
(Of course, this statement is true only when the objective function being 
maximized has a unique mode; but the multi-modality does not seem to 
be an issue addressed in the paper, and at any rate it is better handled 
by a full Bayesian method than by the MLE approach.) Other than for 
large simulation studies, the extra computation needed for going beyond one 
step or even for iterating until convergence is usually quite acceptable given 
current computing power. At any rate, we can view this extra computational 
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cost as an insurance premium for guarding against accidental "unreasonable" 
starting points, assuming, of course, the fully iterated results are what we 
are really after. 

This brings me to my second point. Theoretically, I am curious about 
how the ascent property helps to improve the performance of the fc-step 
estimator, as k increases. Or does it? If one considers higher (penalized) 
likelihood value as desirable, then of course the improvement is automatic. 
But does higher penalized likelihood value always transfer to better correct- 
fit rate? There is no evidence in the paper — or elsewhere as far as I am aware 
of — to suggest this. Quite to the contrary, the numerical evidence (e.g.. Table 
3) the authors provided seem to indicate that the opposite is more than a 
theoretical possibility, because there the one-step SCAD outperformed the 
fully iterated SCAD by a large margin. 

This outperformence is rather unsettling for me. If higher penalized like- 
lihood values do not necessarily transfer to better fitting proportion, which 
is one of the main criteria used, why then maximize the likelihood to start 
with? Could it be that the one-step LLA worked well not because it approx- 
imates the "ideal" full algorithm well, but rather it — perhaps by accident — 
corrects the defect in the full MLE? Actually, this is not the first time that 
such a phenomena has been noted. In Silverman, Jones, Wilson and Nychka 
[15], it was reported that stopping an EM iteration earlier can provide better 
image reconstructions than reaching the full MLE, due to the over-fitting 
tendency with the MLE approach, especially when there are many parame- 
ters involved. A remedy is to be more Bayesian: to rely on the entire posterior 
distribution rather than just its mode (even the use of the posterior median 
can help to reduce the over-fitting rate because of the robustness property of 
the median). One can even achieve this, at least partially, without paying the 
B-club membership; Silverman et al. [15] inserted a "smoothing step" into 
EM to regulate the over-fitting problem. I wonder if a similar idea could help 
to reduce the much higher over-fitting rates for the full SCAD, as compared 
to the one-step SCAD, in the authors' Table 3. Again, it would be useful to 
include the posterior median approach in the authors' comparisons, so we 
can have a better diagnosis as to whether the failure of the full SCAD is 
due to algorithmic inaccuracy or rather defects in the underlying principles 
or methods. 

3. Some ironies. As I am finishing this discussion, several ironies come 
to mind. First, usually a discussant is more critical of topics s/he has dived 
well into, and much more "generous" otherwise, in fear of making a fool of 
her/himself. I clearly have failed to follow this wisdom, as my "outsider's 
comments" touch upon more critical issues than my "insider's inspection" 
does. I hope that the authors will tolerate my adventurous style for wan- 
dering outside boundaries that I am comfortable with, and will take these 
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outsider's comments only as seriously as the authors think they deserve to 
be. 

Second, Section 2 of my discussion seems to complain loudly about the 
lack of Bayesian thinking, and practice, in the current paper or in much 
of the similar literature. But the authors' formulation of the "EM = MM" 
result is squarely Bayesian, as it invokes a prior and even a hyper-prior 
specification. In contrast, my original construction and derivation, as quoted 
in this discussion from Meng [10], is squarely likelihood-based. I want to 
mention this to emphasize the fact that — to echo the title — we are all multi- 
color cats and we all try to catch as many mice, or even rats, as we can. 

Last, although I have tried to take my discussant's role very seriously 
(and perhaps too seriously since I just couldn't "let it go" even when it was 
three months overdue), I have not done the one thing that most discussants 
would do first, that is, to congratulate the authors for a stimulating paper! 
So here it comes: Congratulations and Meow!!! 

4. Acknowledgments. I thank the Co-Editor, Jianqing Fan, for his kind 
invitation and his extraordinary patience in tolerating my tardiness in prepar- 
ing this discussion, and Paul Baines, Xuming He and Thomas Lee for helpful 
comments. 
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